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A new heuristic based on vertex invariants is developed to rapidly distinguish non-isomorphic 
graphs to a desired level of accuracy. The method is applied to sample subgraphs from an E.coli 
protein interaction network, and as a probe for discovery of extended motifs. The network's structure 
is described using statistical properties of its A'^-node subgraphs for < 14. The Zipf plots for 
subgraph occurrences are robust power laws that do not change when rewiring the network while 
fixing the degree sequence — although the specific subgraphs may exchange ranks. However the 
exponent depends on N. The study of larger subgraphs highlights some striking patterns for various 
A'. Motifs, or connected pieces that are over-abundant in the ensemble of subgraphs, have more 
edges, for a given number of nodes, than antimotifs and generally display a bipartite structure 
or tend towards a complete graph. In contrast, antimotifs, which are under- abundant connected 
pieces, are mostly trees or contain at most a single, small loop. The extension to directed graphs is 
straightforward. 
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I. INTRODUCTION 

The recent surge of interest in complex networks has 
often targeted general features of organisation Q, |^ H, ^ 
Hij ^ J.: ■ A number of common properties have 

been observed, including the so-called small world effect, 
fat tails in the distribution of the node degree (the "scale- 
free" network), as well as clustering. Although the last 
two attributes are statistical properties of the local net- 
work structure, networks that share these features may 
nonetheless exhibit totally different specific local struc- 
tures. Certain connected subgraphs with three or four 
nodes, termed "motifs" PJlil^il^J, turn out to be signifi- 
cantly over-abundant in real networks when compared to 
null models. These null models are typically randomised 
networks where the smaller scale structure (e.g. node 
degree) 0| is determined by the original network. It is 
believed that networks with similar functions - for exam- 
ple, forward logic chips and neural networks - display the 
same motifs [III . A growing body of evidence indicates 
that particular motifs perform specific functions in gen e 
transcription networks El [H El IH El E IH Hill . 
In addition, proteins within motifs are more conserved 
across species than proteins that do not form part of 
such units |2ll2i|. 

Motifs and antimotifs, which are significantly under- 
abundant connected subgraphs, may also be useful in 
classifying networks and comparin g re al-life situations 
to theoretical models. Milo et al. [2J| explored signif- 
icance profiles: normalised .Z-scores for particular con- 
nected subgraphs. They claim to find "superfamilies" of 
networks displaying similar profiles. In a similar vein, 
Middendorf et al. 2&\ used exhaustive subgraph enumer- 
ation of networks generated by different theoretical mod- 
els as training data for a machine learning algorithm, and 



developed a discriminative classifier subsequently able to 
identify new networks with success. 

However, all of these approaches have been handi- 
capped by the small size of connected subgraphs. This 
limits the scale where features of organisation in networks 
can be discovered. In most cases, connected subgraphs 
with at most four nodes are considered. Middendorf et 
al. lU searched for two different categories of subgraphs: 
graphs which could be generated by a random walk of 
length less than or equal to eight, and graphs with up 
to seven links - to achieve slightly larger subgraphs. Ziv 
et al. |27j analysed statistically significant measures that 
can be calculated directly from the adjacency matrix. 
These measures are related to subgraphs but lack a one- 
to-one correspondence. Hence the possibility of insight 
into the function of organised structure at different scales 
or the systematic discovery of larger scale structures is - 
from our point of view - lost. 

The existing size limitation for motif discovery leaves 
some interesting questions unanswered. Do motifs ap- 
pear independently, or do they combine to form larger 
organised structures [27L I2M |29J that are overwhelmingly 
represented in the real network compared to an appropri- 
ate null model? If so, what do these extended structures 
look like? What properties of the network's ensemble 
of A^-node subgraphs distinguish it from null models or 
from other networks? Are collections of nodes that par- 
ticipate in motifs of larger sizes also more likely to be 
related to function and/or conserved through evolution- 
ary history? Kashtan et al made some progress in this 
direction by considerin g sp ecific generalisations of three 
and four node motifs |30|. They found that networks 
sharing a particular three node motif favoured different 
generalisations of that motif, suggesting that larger struc- 
tures need to be considered to fully understand how the 
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network is organised. However, this work relied on a pri- 
ori assumptions about possible generalisations to larger 
motifs. Searches were tailored to particular subgraphs. 
A more general analysis is known to be computationally 
difficult ,31,^^^. 



A. Problems in Finding Extended Structures 

There are at least three main problems. The first 
is that the time required for exhaustive enumeration of 
subgraphs increases rapidly with subgraph size, particu- 
larly for large networks. This can be solved by sampling: 
Kashtan et al 33] showed that quite small samples could 
be sufficient to identify motifs with up to seven nodes. 
However, their method requires the calculation of weights 
in order to achieve uniform sampling. Their calculation 
of these weights increases in difficulty, with combinato- 
rial factors, as the the connected subgraph size increases. 
We achieve uniform sampling automatically by picking 
nodes at random from the network - at the expense of 
sampling both connected and disconnected subgraphs. 

The second problem is to determine appropriate null 
model(s) and significance. The standard null model (see 
for instance Ref. is where the degree of every node is 
not allowed to change - so the single node properties are 
fixed. Such an ensemble can be obtained using a Sequen- 
tial Monte Carlo method called " rewiring" . Statistically 
significant deviations from that background are by defini- 
tion coming from node-node correlations. Extending this 
argument, when Milo et al. search for 4- node motifs 
they also fix the actual number of each kind of 3-node 
subgraph in their null model. However, as in Ref. [30l |. 
here we use only the ensemble of fixed degree sequence as 
a null model to test for significance. Explicitly fixing the 
occurrence of (N — l)-node subgraphs is computation- 
ally intractable for larger N. There are not only linear 
constraints between different subgraphs arising from con- 
servation laws (see Ref. ,25J and Sect ion HV C|l associated 
with rewiring but also non-linear correlations caused, in 
part, by the form of the null model. 

The third difficulty lies in distinguishing non- 
isomorphic subgraphs. This is the well-known and no- 
toriously difficult "graph isomorphism problem" [s^ [s^ . 
The number of possible graphs grows faster than expo- 
nentially with TV m. Several algorithms [371 Isl fsl 
liol l4l| are available, but most of these are configured to 
make a comparison for isomorphism between two graphs. 
Comparing each new subgraph pairwise to all subgraphs 
already identified would be far too time-consuming in 
this context. Some existing programs can be altered 
to provide sets of labels to identify particular graphs. 
They tend to be optimised for large graphs (hundreds of 
nodes), and appear to us to be unsuitable for the type 
of search required for discovery of organisation at larger 
scales than three or four nodes. 

At this point in time, discovery of larger scale organi- 
sation does not require particularly large subgraphs. Ten 



to fifteen nodes would already be a significant step for- 
ward, and entails a new set of problems and types of 
behaviours as discussed later. Subgraphs do, however, 
need to be classified quickly if a method is to be prac- 
tical. We present a new heuristic that assigns a set of 
labels to each subgraph as it is sampled, so that isomor- 
phic graphs are guaranteed to have the same label(s), but 
(most) non-isomorphic graphs have different labels. The 
accuracy of the method depends on the number of labels 
used - at the expense of increased computational effort. 
We test the heuristic by comparing with exact enumer- 
ation of all isomorphic graphs up to = 8. Combined 
with a sampling technique, our heuristic is used to iden- 
tify extended motifs of a protein interaction network. We 
sample both connected and disconnected subgraphs uni- 
formly by picking A^ distinct nodes at random. Motifs 
are then discovered by looking at the significance - with 
some caveats - of individual subgraphs that contain these 
structures as distinct pieces. 



B. Summary 

The labelling algorithm is described in Section^ In 
Section IHll various stages of the algorithm are tested. 
The full algorithm successfully distinguishes all graphs 
with up to eight nodes Differences in the running 

times and accuracy of the stages are also discussed. In 
Section IIVI the algorithm is used to identify extended 
motifs and antimotifs in the E. coli protein interaction 
network. The motifs all share a remarkably similar bi- 
partite structure, which is completely different from the 
long chains and tails seen in antimotifs. For fixed A^ the 
distribution of all subgraph counts is found to obey a 
power law, where the exponent depends on A^. However, 
the Zipf plots of the real and randomised networks are 
quite similar although the subgraphs exchange rank. In 
Section |Vl we conclude with a summary. 



II. THE LABELLING ALGORITHM 

The algorithm developed here can be applied to both 
simple graphs and digraphs - graphs with directed edges. 
Here we will concentrate on the algorithm for simple 
graphs, leaving the straightforward generalisation to di- 
graphs to a later publication. 

Motif discovery requires a fast way to identify graphs 
that are isomorphic. One way to be certain that two 
graphs are isomorphic is to find the isomorphism that 
maps one to the other. This is a permutation of the 
vertex labels of one graph such that its list of links be- 
comes identical to that of the other graph. To show that 
two graphs are not isomorphic therefore requires proving 
that no such isomorphism exists, which in theory requires 
checking every possible permutation of the vertices. Since 
there are A^! such permutations for a graph with A^ nodes, 
this is far too time-consuming to be practical. Many al- 
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(a) (b) 

FIG. 1: Two non-isomorphic graphs that cannot be distin- 
guished by any of the invariants proposed by Remie. 

gorithms therefore start by trying to reduce the number 
of permutations that need to be checked, usually by ap- 
plying some kind of "canonical labelling" or ordering 
to the vertices. For example, if a unique way of ordering 
the vertices in both graphs can be found, then vertices 
of the same rank must map to each other - in order for 
the graphs to be isomorphic. 

An alternative approach is to try to find an invariant 
under permutation, or set of invariants, that uniquely 
labels any graph. The use of invariants ensures that iso- 
morphic graphs always receive identical labels. However 
it is not certain that non-isomorphi c gr aphs will receive 
at least one different label. Remie [43 defines four dif- 
ferent invariants, but none of these can distinguish the 
eight node graphs in Fig. ^ as non-isomorphic. 



A. Invariant Vertex Labels 

Our approach defines vertex invariants through a gen- 
eralisation of standard canonical labelling Usually, 
the canonical label depends only on the degrees of the 
vertex being labelled together with its immediate neigh- 
bours. This means, for example, that all vertices in a long 
chain (except the two endpoints) receive the same label, 
whereas it is clear that nodes near the end of the chain 
should be distinguishable from nodes nearer the middle. 
Bearing this in mind, we have extended the usual canon- 
ical labelling to include all vertices in the graph. In the 
case of a graph made of disconnected pieces, we include 
all vertices in the connected piece containing the vertex 
being labelled. 

As with usual canonical labelling, our label is a sum 
of powers of two, with the vertex degrees, kj, determin- 
ing the power. To include all vertices, but give a higher 
weight to those closest to the vertex Vi being labelled, 
we include an additional factor of 2^~^", where x is 
the diameter. This diameter is the maximum shortest 
path between any two vertices on the connected piece of 
the subgraph containing Vi. The quantity Xij is the dis- 
tance between vertices Vi and Vj, where Vj is required 
to be connected to Vi by some path. The lowest possible 
weighting is 2*^ = 1 (if Xij = x), and the highest weight- 
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FIG. 2: Vertex labels calculated using Eq. Q. Higher values 
are assigned to more central vertices, or those with higher 
degrees. 



ing (2^) is given to Vi itself. Each vertex Vi is assigned a 
label Xi as follows: 

connected 

Xi^ 2^-^-^+'=^" , (1) 

j 

where kj is the degree of vertex Vj . The sum is taken 
over all vertices in the graph, or if the graph contains sev- 
eral disjoint subgraphs, over all vertices in the connected 
subgraph containing Vi. 

The labels defined by Eq.^bave an intuitive meaning. 
More connected or central vertices have higher values. 
Fig. 121 gives some examples of the labelling scheme for 
different subgraphs. The labels Xi are clearly higher for 
more central vertices than those closer to the edge. 



B. Invariant Graph Labels 

The set of vertex labels could be used directly to dis- 
tinguish graphs, but they would need to be ordered, for 
instance in descending order, before comparisons between 
graphs could be made. Another approach is to combine 
the vertex labels to obtain a small set of graph labels. 
One candidate graph label is the sum l[ — (J^i ^i) ■ Un- 
fortunately it does not produce unique labels. Fig. |21 
shows two graphs that have the same sum despite hav- 
ing different vertex labels (and hence being clearly non- 
isomorphic). However the product does not suffer from 
this defect. In theory it could, but in practice we have 
not found it to be the case for the graphs studied. Our 
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FIG. 3: These two graphs have different vertex labels Xi, 
which nonetheless combine to give the same sum: I'l = 68 + 
68 + 56 + 52 + 52 = 64 + 64 + 56 + 56 + 56 = 296. Their graph 
labels Zi, however, are not equal. 

first graph label is therefore defined to be 

li=\{X, . (2) 

i 

Note that this product is over all the vertices in the 
graph, whether it is connected or made of disjoint pieces. 
Should this product become too large to be conveniently 
stored as an integer, the first several (eg. 9) digits can 
be used instead, without causing any degeneracy in la- 
bels. Again, this is an empirical observation rather than 
a mathematical certainty. However, this is not the end 
of the story. 

We found that li successfully distinguishes all graphs 
with up to five nodes, but there are two pairs of non- 
isomorphic graphs with six nodes that are assigned iden- 
tical values. The graphs in Fig. ^ provide another prob- 
lematic example. These graphs are highly symmetric. In 
both graphs, every vertex has degree five - with the re- 
maining nodes at distance Xj —2. Hence all the labels, 
X, = 2^ * 2^ -h 5 * 2i * 2^ -h 2 * 2° * 2^ = 576, are identical. 

If all vertices are equally "connected" , but the two 
graphs are not isomorphic, what is the difference between 
them? Taking their complements (exchanging links and 
non-links for every vertex pair) as shown in Fig. 0] re- 
veals the source. While the complement of graph (a) is 
a single loop with 8 links, which we shall now refer to 
as an 8-loop, that of graph (b) consists instead of two 
4-loops. Applying our labelling method to these comple- 
ments produces unique labels, which suggests a possible 
solution to the problem. For all graphs, first calculate li 
as described above. Then take the complement of each 
disconnected subgraph of the graph. Recalculate labels, 
Yi, for this new graph, and combine these labels into the 
product 

h^X{Y,, (3) 

i 

where the product is again taken over all vertices in 
the graph. Each graph is then labelled by the vector 
(Zi, ^2, N), where L is the total number of links in the 
graph. Note that for disjoint graphs, it is extremely im- 
portant to take the complement of each connected sub- 
graph individually] if the complement of the whole graph 



(a) (b) 

FIG. 4: These two graphs are the complements of the graphs 
shown in Fig.0 



is taken instead, small disconnected pieces can cause 
problems, so that degeneracy in labelling appears for 
quite small graphs. An algorithm with these graph labels 
was tested by applying it to every possible labelled graph 
for iV < 8 and measuring the number of distinct sets of 
labels. This number was compared to the true number 
of non-isomorphic graphs. Those were determined using 
Polya's enumeration theorem. The algorithm uniquely 
labelled every graph with up to six nodes {N — 6), dis- 
tinguished 1038 out of 1044 for TV = 7 and 12078 out of 
12346 for iV 8. Even for TV = 8 almost 98% of distinct 
graphs were uniquely labelled. 

What further invariant properties can be used as la- 
bels? Again, considering the complements in Fig. ^pro- 
vides a clue - their different loop structures. In fact the 
numbers of all loops except 3-loops are different for the 
two graphs in Fig. We counted all the loops in a graph 
by searching through its adjacency matrix. The number 
of 3-loops (ns), 4-loops (^4) etc. can then be incorpo- 
rated as extra labels, so that each graph is labelled by the 
vector (Zi, /2, L, 71.3, 714, njv). This adapted algorithm, 
when tested, correctly distinguished all graphs with up 
to = 8 nodes. Exhaustive testing of graphs with more 
nodes is not worthwhile at present, as the program for 
N — 9 would run for more than a year on a present day 
standard laptop. 



III. TESTING THE ALGORITHM 

This section may be skipped by those primarily inter- 
ested in motif discovery. As stated in Sectional all stages 
of the algorithm have been tested exhaustively for graphs 
with up to eight nodes. A simple graph with N nodes 
contains Lmax ~ (^) ~ N{N — l)/2 vertex pairs. Thus 
Lmax is the maximum possible number of links, and 
2Lmax jg ^jjg number of labelled graphs. An easy way to 
generate all labelled graphs is to cycle through the binary 
numbers between and 2^^'^^ — 1, loading their digits 
in order into the off-diagonal elements of an adjacency 
matrix. The labelling algorithm can then be successively 
applied to each matrix or graph. The accuracy of the 



algorithm can be evaluated by comparing the number 
of graphs correctly distinguished to the true number of 
non-isomorphic graphs, as determined by Polya's enu- 
meration theorem. The results for different stages of the 
algorithm are shown in Table|l| Note that since the labels 
are invariants, isomorphic graphs must be assigned the 
same set of labels. Thus it is not possible to overcount 
the number of distinct graphs. Undercounting is possi- 
ble, however, since non-isomorphic graphs may nonethe- 
less have similar enough structures to produce degenerate 
labels. 

Table Q shows that incorporating loop counting to- 
gether with ll and I2 is the most accurate method. How- 
ever the cost in computing time is significant. On a stan- 
dard laptop, for = 8 it took four and a half hours to 
compute ll alone, six hours to compute li and l2, and 
twenty six hours for the full algorithm including loop 
counting. Using li and I2 without loop counting works 
perfectly up to TV = 6, but then misses 6 graphs (0.6%) 
at iV = 7 and 268 graphs (2.2%) at iV = 8. The graphs 
shown in Fig. [S] are typical examples of pairs not distin- 
guished by either li or I2. One graph can be mapped to 
the other by switching the endpoints or "rewiring" two 
links. The complements of the graphs share the same 
property; hence the degeneracy in I2 as well as h. 

Another possible route might be to omit I2 when loop 
counting is included. Using li plus loop counting works 
perfectly up to = 7, but fails to distinguish two pairs 
of graphs at = 8 (see Fig. The danger, as with 
omitting loop counting, is that once an algorithm misses 
even a small percentage of graphs for some N, it misses 
more and more as N increases. 

To summarise: The combination of li, I2 and loop enu- 
meration differentiates all non-isomorphic graphs with up 



TABLE I: Number of graphs distinguished by different graph 
labels compared to the exact number of graphs calculated 
using Polya's enumeration theorem, shown in the second col- 
umn. The third column shows the result obtained by using 
the sum, [[, rather than a product, li, of the vertex labels. 
In the remaining columns h and h are as defined in Equa- 
tions ^ and The last column includes the number of 
loops as graph labels. 
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FIG. 5: The bottom pair of graphs are the complements of 
the top pair. Neither pair can be distinguished by h or h. 
The pairs exhibit different loop structures and can therefore 
be differentiated by loop enumeration. 




FIG. 6: The top and bottom pairs have the same vertex labels 
and the same loop structure. Both pairs are distinguished by 
the labels of their complements. Note that the bottom pair 
are identical to the top - save for the addition of one extra 
link. 



to eight nodes. However, loop counting is very time con- 
suming, and omitting it only causes around 2% of the 
= 8 graphs to be degenerately labelled. With the 
above mentioned caveats we proceed with a subgraph 
census obtained by sampling a protein interaction net- 
work using the algorithm with li and I2, but without 
loops. 
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FIG. 7: Zipf plot for N=9 subgraphs of the giant component 
of the E. coh network, for sample size 10*. Also shown are 
the Zipf plots for the rewired network and for a Bernoulli or 
Erdos-Renyi (ER) random graph with the same link probabil- 
ity and same sample size. Fixing the degree sequence almost 
exactly fixes the Zipf plot while the specific subgraphs ex- 
change rank under rewiring. 



IV. ENSEMBLES OF SUBGRAPHS AND 
MOTIF DETECTION IN A PROTEIN 
INTERACTION NETWORK 

We now present results for the statistics of subgraphs 
in the protein interaction network of E. coli The his- 
togram of all non-isomorphic subgraphs in the network 
is a characterisation of that network. This is termed a 
"subgraph census" The ensemble of subgraphs is 

obtained by uniform sampling rather than exact enumer- 
ation. This should give an accurate picture of the true 
census up to statistical fluctuations and an overall nor- 
malisation. Uniform sampling of connected and discon- 
nected A'^-node subgraphs is achieved by picking N nodes 
at random. Results were compared with exact enumera- 
tion for small N. Since there is no inherent directionality 
in the interactions themselves, we have chosen to treat 
the network as undirected. The network has 270 nodes 
and 716 links; however it is not fully connected: seven- 
teen pairs of nodes connect only to each other, and there 
are two isolated triplets. The largest connected com- 
ponent consists of 230 nodes and 695 links. Both this 
piece, termed the giant component (GC), and the entire 
network are studied. 



A. Zipf s Law for Subgraph Census 

We first consider subgraphs with a fixed number of 
nodes and ask what is the frequency of occurrence of 
different subgraphs. For each > 5 a sample of 10^ 
subgraphs were obtained. The ensembles for iV = 3 and 
N = 4 do not have enough subgraphs to obtain a smooth 
distribution. The labels L, h and I2 were used to iden- 
tify graphs, but loop counting was not included. The 



FIG. 8: Zipf plots obtained from the giant component of the 
E. coli network for subgraphs with varying numbers of nodes 
iV. 



subgraphs were then ranked in descending order of oc- 
currence, and Zipf plots were made |47j . 

The Zipf plots all indicate power law behaviour. Fig- 
ure[7|shows a typical example. The distribution obtained 
from the GC was compared to two different null cases. 
The first, denoted "randomised" in Fig. [T] is a rewired 
version of the GC with the degree of each node fixed. 
This was generated by repeatedly choosing two links in 
the network at random and swapping their endpoints, un- 
til mixing was achieved. As usual, mixing was evaluated 
a postiori. Swaps are disallowed if they create self-loops 
or produce a pre-existing link. The second null model is 
a random Erdos-Renyi (ER) network with the same link 
probability as the real network. For the GC of the E. coli 
network, this hnk probability is p = 695/(^2°) ~ 0-0264, 
for the original network p = 716/ i^l^) ~ 0.0197. An en- 
semble of 10^ graphs with the desired number of nodes 
was generated using a Bernoulli process. In particular, a 
random number was placed on each pair of distinct nodes 
to determine whether or not a link would be made. This 
ensemble is denoted "ER" in the Zipf plots. As demon- 
strated in Fig.[7| the Zipf plots of the real and randomised 
networks are almost identical, but differ noticeably from 
the ER network. This is true for all N , and for both the 
GC the entire E. coli network. 

Figure |S1 shows Zipf plots for the GC with varying sub- 
graph sizes. It can be seen that all five sizes are consistent 
with power law behaviour, although A'^ = 6 is less smooth 
than the others because there are fewer subgraphs. The 
main difference between the Zipf plots is that as N in- 
creases the gradient becomes shallower. Hence, it ap- 
pears that the exponent is not universal with respect to 
N. 

Zipf plots for the original network and its GC are also 
similar. As Fig. shows, the plots for the real network 
and the randomised network with identical degree se- 
quence are close in both cases. The main difference be- 
tween the GC and the entire network is that in the latter 
case the distribution is somewhat broader. However, the 
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FIG. 9: Zipf plots for A*' = 11 subgraphs in diflFerent networks: 
(a) real E. coli network, (b) rewired E. coli network, (c) ER 
network with same number of nodes and link probability as 
(a), (d) giant component of E. coli network, (e) rewired giant 
component, and (f) ER network with same number of nodes 
and link probability as (d). 

curves for the ER networks with corresponding link prob- 
abilities show the same tendency, which suggests that the 
difference in link probability may be the main factor for 
this trend. 



B. Evidence for Motifs 

Although the collection of subgraph counts are al- 
most identical for the real and randomised networks, the 
rank of individual subgraphs within each census differs 
markedly. The subgraphs of the randomised network 
were arranged in the same order as those in the real 
network to get the scatter plot shown in Fig. ^| For 
comparison, the Zipf plot for the real network is also 
shown as a connected line. The vertical difference be- 
tween each point and the line indicates the difference in 
the number of occurrences of a particular subgraph in 
the randomised network as compared to the original one. 
Note that the rank of the subgraph in the original net- 
work gives a unique tag to that subgraph. It can clearly 
be seen that the counts of certain individual subgraphs 
vary by orders of magnitude between the two networks. 

These large differences are not just a statistical arti- 
fact of the rewiring process, as can be seen by re-doing 
the Fig. IIUI for two randomised networks with the same 
degree sequence as the E. coli network. Now the sub- 
graphs are ordered according to their occurrence in the 
first randomised network. Comparing Figs. ^| and 1111 
note that the scatter of points around the line (particu- 
larly below the line) in the latter case is significantly less 
than the former. This suggests the existence of "motifs" 
[Til [l3l| : particular subgraphs that are significantly 
over-abundant in the real network compared to its en- 
semble of randomised networks. 

To explore the issue of motifs further, subgraph counts 
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FIG. 10: Occurrences of = 14 subgraphs for the real (red 
line) and randomised (black points) networks. The subgraphs 
in the randomised network are placed along the a;-axis in the 
same order as those in the real network to allow direct compar- 
ison between counts for each subgraph. Points significantly 
below the line represent motifs, while those significantly above 
represent anti-motifs. 
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FIG. 11: This graph compares two randomised versions of 
the E. coli network in exactly the same way that the network 
and a randomised version of it were compared in Fig llOl The 
fluctuations, or scatter below and above the line in Fig. llOl are 
much larger, indicating a pattern of statistically significant 
deviations of subgraph occurrences in the original network. 



from the real network were compared to counts from sev- 
eral randomised networks. For iV = 3 and = 4, we 
made an exhaustive enumeration of every subgraph. This 
was done for the real network and one hundred different 
randomised networks. The mean and standard deviation 
of the randomised counts were then computed, allowing a 
Z-score to be calculated. Fig.lT^shows the results for the 
original network (with Z-scores for the GC in brackets). 
The counts in the ER column are theoretical expectation 
values for an ER network of the appropriate size and link 
probability. 
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C. Linear Constraints Between Subgraphs 

For = 3 all the Z-scores all have the same mag- 
nitude. This is a direct consequence of the strict con- 
servation of the degree sequence in the rewiring proce- 
dure |46f. Consider a particular swap between two links. 
The only 3-node graphs that can possibly be affected 
are those that contain at least one of the newly created 
or newly deleted links. At least two of the three nodes 
must therefore be chosen from among the four at the ends 
of the swapped links. The 4-node graph formed by the 
swapping nodes themselves is always unchanged by all al- 
lowed swaps (recall that it is not permitted to duplicate 
a pre-existing link). Its 3-node subgraphs are therefore 
also unaffected. The remaining possibility involves 5- 
node graphs containing one extra node in addition to the 
four swapping nodes. This extra node can have between 
zero and four links connecting it to the four swapping 
nodes. It turns out that there are only three pairs of 
5-node graphs that can be interchanged by link swap- 
ping. In every case, the count of TV = 3 graphs with no 
links decreases by one, that of one-link graphs increases 
by three, that of two-link graphs decreases by three and 
that of three-link graphs increases by one (up to an over- 
all sign). This exact equality produces coincidence of the 
Z-scores. 

The only remaining degree of freedom for the deviation 
of the actual network from its randomised ensemble is a 
single signed number. Its value indicates a significant dif- 
ference between the real network and random networks 
with the same degree sequence, although it is impossible 
to ascribe this significance to any one subgraph in partic- 
ular. Note that for the empty 3-node graph, its statistical 
under-abundance in the real network is due to the fact 
that the variance of this number in the ensemble is tiny, 
because those changes are slaved to a variable (the con- 
nected triangle) with small numbers. The actual under- 
abundance of empty 3-node subgraphs is an unimportant 
fraction of the overall number of those subgraphs. This 
illustrates the potential difficulties with assigning impor- 
tance to individual subgraphs based on their individual 
Z-scoie - when the Z-scores must be correlated. 

Conservation rules for subgraphs under rewiring was 
previously observed by (25i] for 3-node subgraphs in di- 
rected networks, where although there are thirteen dif- 
ferent connected motifs, only seven degrees of freedom 
are independent. For undirected A^-node graphs, there 
are N conservation laws corresponding to moments of 
fcj™ with TO = 0, 1 ... TV - 1. Hence for iV = 3 there is 
only one independent degree of freedom while for A'^ = 4 
there are seven. 



D. Motif Selection 

Ignoring the potential problems associated with at- 
taching physical importance to specific subgraphs with 
high individual Z-scores, we find that for = 4 two 



graphs stand out as being particularly over- or under- 
abundant. The square graph labelled li = 1679616 is 
over-represented, while the same graph with one edge 
missing {li — 6350400) is under-represented. It is also in- 
teresting to note that graphs with more (less) links tend 
towards over (under)-abundance. Overall the Z-scores 
are modestly lowered for the GC, but the same overall 
trends emerge in both cases. In particular, the same two 
subgraphs are readily identified as motif and anti-motif. 

For A^ > 5, an exhaustive scan of all subgraphs is time- 
consuming, so uniform samples of 10^ subgraphs were 
used instead. Subgraphs do not need to be fully con- 
nected in order to be useful for identifying motifs. As for 
A^ = 3 and A^ = 4, the real network was compared to 
an ensemble of randomised networks with the same de- 
gree sequence. Only 20 networks were included, though, 
rather than 100. Twenty was chosen as the smallest 
number for which standard deviations and Z-scores are 
reasonably stable. Checks show that when calculations 
are repeated, the Z-scores obtained vary slightly, but the 
same graphs always stand out as motifs. 

The main difficulty is that too many subgraphs have 
high individual Z-scores. This may be related to the cor- 
relations discussed above. Ignoring previously mentioned 
caveats, we proceeded by using other selection criteria 
to choose the most important. After some experimen- 
tation the following ad hoc rules were used to identify 
motifs. Two different samples were taken from the real 
network, and Z-scores were computed comparing each of 
these to the same ensemble of 20 randomised networks. 
A subgraph was identified as either a motif (if it was 
connected) or containing a motif (if it was disconnect) 
if Z > 10 (or Z < — 10 for anti- motifs) for both sam- 
ples. Note that we only consider connected pieces to be 
motifs even though the subgraphs from which motifs are 
identified may be disconnected. Requiring |Z| > 10 for 
two different samples largely eliminates statistical oddi- 
ties, which can otherwise occur for subgraphs with low 
counts. The relatively high cut-off in Z also helps ensure 
statistical stability, as was also noted in [S^l ■ Even then, 
the number of new motifs identified increases dramati- 
cally with A'^. To overcome this problem, only subgraphs 
whose Z-scores were in the top fifty for that value of A^ 
were considered. Again, this had to be true for both sam- 
ples. Motifs identified at a given A^ tend to reappear as 
connected components in disconnected graphs at higher 
A^ - see for example the graph labelled 4096(1) for A^ = 3 
and A^ = 4 in Fig. ^1 The last condition was therefore 
that a new motif has to replace an old one in the top fifty 
to make the grade. 

Since including extra, unconnected nodes does not 
change the label of a graph it easy to identify and elim- 
inate previous motifs at each new value of A^. Motifs 
with a given number of nodes are not always discovered 
straight away; for example an A'^ = 6 motif may not meet 
the condition Z > 10 in the sample of A^ = 6 subgraphs 
but show up much more strongly (with one or two dis- 
connected nodes) at A^ = 7 or A^ = 8. This often means 
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that subgraphs which only just fail the criteria at one N 
are positively identified at the next. This trend makes 
the selection of motifs more robust against small changes 
in the rules used to identify motifs. At some point, how- 
ever, the number of genuine new motifs found begins to 
account for a smaller and smaller proportion of newly 
identified subgraphs. We also found that for > 9 a 
smaller proportion of sampled subgraphs had \Z\ > 10. 
Because of these diminishing returns, the present search 
was stopped after A^ = 10. 

There are several possible reasons for this loss of ef- 
ficiency: one is the finite size of the E.coli network, or 
another property of the network. It is also possible that 
the heuristic may be starting to fail, recalling that the 
most accurate version was not employed because of time 
constraints. Wrongly classifying a small percentage of 
nonisomorphic graphs as isomorphic is unlikely to make 
much difference, but if the problem worsened, genuine 
motifs could be swamped by other subgraphs which are 
more common in the randomised networks. This poten- 
tial difficulty does not cast doubt on the motifs or anti- 
motifs presented here, as none of them fall into the cat- 
egories of graphs that cause problems, which have been 
thoroughly investigated for N < 8. However, further in- 
vestigation might be appropriate before attempting to 
use this method for much larger subgraphs. 

The original network was considered first, then cal- 
culations were repeated on the GC of the network the 
first few A^. The same motifs were identified for both 
networks, although the order in which they were found 
varied slightly. We therefore conclude that the technique 
is robust. 



E. Patterns in Motifs 

The motifs found are shown in Figs. El (over- 
abundant) and El(under- abundant). Some striking pat- 
terns appear. First, many of the motifs have a bipartite 
structure where the vertices can be divided into two sets 
such that no links exist within either set, but many links 
exist between members of opposite sets. Many graphs 
display a complete matching: each vertex is connected to 
every member of the other set. Many more graphs have 
almost complete matchings, missing just one or two links. 
Again, some graphs are almost bipartite, with complete 
or almost complete matchings between two sets of ver- 
tices, and just a few matchings within each set. Some of 
these latter graphs may be seen as interpolating between 
bipartite graphs and complete graphs, where every ver- 
tex connected to every other vertex. Complete graphs 
at A^ = 4 and N = 5 are observed as motifs. All mo- 
tifs have a high link:node ratio. In fact, L > N for all 
motifs. Finally, the remaining motifs fall into one of the 
categories described above, with the addition of one or 
two "hanging" links. 

Antimotifs follow a completely different pattern. They 
occur mostly as trees or may contain at most a single 



loop (usually a triangle, but there are two pentagons and 
one square) with long tails. This is to be contrasted with 
the bipartite structures of the over-abundant subgraphs, 
which typically contain many loops. They also have fewer 
links than motifs: either L — N — 1 for pure chains or L — 
N, if there is one loop). This difference in the link:node 
ratios is readily apparent in Table UTI In fact, for a given 
A^ no overlap in L values for motifs and antimotifs exists. 

V. SUMMARY 

This paper addresses some of the problems associated 
with finding extended structures in complex networks. 
We propose a new heuristic for graph isomorphism and 
validate its accuracy for classifying all undirected sub- 
graphs with A^ up to 8. A version of the algorithm is 
used, together with uniform sampling, to obtain statisti- 
cal signatures of the ensemble of A-node subgraphs in an 
E. coli protein interaction network for subgraphs with A^ 
up to 14. The distribution of subgraph occurrences fol- 
lows a power law and the Zipf plots do not change signif- 
icantly under rewiring. Sampling all possible subgraphs 
for various A^ allows for the discovery of extended mo- 
tifs. Motifs are considered to be individual, connected 
graphs that are vastly over-represented in the network 
compared to a null model. They have more edges, for 
a given number of nodes, than antimotifs and generally 
display a bipartite structure or tend towards a complete 
graph. In contrast, antimotifs, are mostly trees or con- 
tain at most a single, small loop. The heuristic for graph 
isomorphism developed here can be applied with minor 
changes to directed graphs. 
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TABLE II: The number of motifs (bold) and antimotifs (ital- 
ics) with a given number of nodes, A'^, and links, L. The two 
classes are separated in this space, and do not overlap. 
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FIG. 12: Results for subgraphs with N = 3 and N = 4 nodes in the E. coU protein interaction network. The third column 
shows the counts obtained by exact enumeration for the real network, while columns 4-6 show results obtained from exact 
enumeration of subgraphs in an ensemble of 100 networks with the same degree sequence. Standard deviations for the giant 
component are shown in brackets in column 6. The last column shows theoretical expectation values for ER random graphs. 




FIG. 13: Motifs (over-abundant subgraphs) of the E.coli protein interaction network. 
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